1,000 research outputs found

    A hierarchical latent variable model for data visualization

    Get PDF
    Visualization has proven to be a powerful and widely-applicable tool the analysis and interpretation of data. Most visualization algorithms aim to find a projection from the data space down to a two-dimensional visualization space. However, for complex data sets living in a high-dimensional space it is unlikely that a single two-dimensional projection can reveal all of the interesting structure. We therefore introduce a hierarchical visualization algorithm which allows the complete data set to be visualized at the top level, with clusters and sub-clusters of data points visualized at deeper levels. The algorithm is based on a hierarchical mixture of latent variable models, whose parameters are estimated using the expectation-maximization algorithm. We demonstrate the principle of the approach first on a toy data set, and then apply the algorithm to the visualization of a synthetic data set in 12 dimensions obtained from a simulation of multi-phase flows in oil pipelines and to data in 36 dimensions derived from satellite images

    Probabilistic principal component analysis

    Get PDF
    Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss the advantages conveyed by the definition of a probability density function for PCA

    Ecological indicators for abandoned mines, Phase 1: Review of the literature

    Get PDF
    Mine waters have been identified as a significant issue in the majority of Environment Agency draft River Basin Management Plans. They are one of the largest drivers for chemical pollution in the draft Impact Assessment for the Water Framework Directive (WFD), with significant failures of environmental quality standards (EQS) for metals (particularly Cd, Pb, Zn, Cu, Fe) in many rivers linked to abandoned mines. Existing EQS may be overprotective of aquatic life which may have adapted over centuries of exposure. This study forms part of a larger project to investigate the ecological impact of metals in rivers, to develop water quality targets (alternative objectives for the WFD) for aquatic ecosystems impacted by long-term mining pollution. The report reviews literature on EQS failures, metal effects on aquatic biota and effects of water chemistry, and uses this information to consider further work. A preliminary assessment of water quality and biology data for 87 sites across Gwynedd and Ceredigion (Wales) shows that existing Environment Agency water quality and biology data could be used to establish statistical relations between chemical variables and metrics of ecological quality. Visual representation and preliminary statistical analyses show that invertebrate diversity declines with increasing zinc concentration. However, the situation is more complex because the effects of other metals are not readily apparent. Furthermore, pH and aluminium also affect streamwater invertebrates, making it difficult to tease out toxicity due to individual mine-derived metals. The most characteristic feature of the plant communities of metal-impacted systems is a reduction in diversity, compared to that found in comparable unimpacted streams. Some species thrive in the presence of heavy metals, presumably because they are able to develop metal tolerance, whilst others consistently disappear. Effects are, however, confounded by water chemistry, particularly pH. Tolerant species are spread across a number of divisions of photosynthetic organisms, though green algae, diatoms and blue-green algae are usually most abundant, often thriving in the absence of competition and/or grazing. Current UK monitoring techniques focus on community composition and, whilst these provide a sampling and analytical framework for studies of metal impacts, the metrics are not sensitive to these impacts. There is scope for developing new metrics, based on community-level analyses and for looking at morphological variations common in some taxa at elevated metal concentrations. On the whole, community-based metrics are recommended, as these are easier to relate to ecological status definitions. With respect to invertebrates and fish, metals affect individuals, population and communities but sensitivity varies among species, life stages, sexes, trophic groups and with body condition. Acclimation or adaptation may cause varying sensitivity even within species. Ecosystem-scale effects, for example on ecological function, are poorly understood. Effects vary between metals such as cadmium, copper, lead, chromium, zinc and nickel in order of decreasing toxicity. Aluminium is important in acidified headwaters. Biological effects depend on speciation, toxicity, availability, mixtures, complexation and exposure conditions, for example discharge (flow). Current water quality monitoring is unlikely to detect short-term episodic increases in metal concentrations or evaluate the bioavailability of elevated metal concentrations in sediments. These factors create uncertainty in detecting ecological impairment in metal-impacted ecosystems. Moreover, most widely used biological indicators for UK freshwaters were developed for other pressures and none distinguishes metal impacts from other causes of impairment. Key ecological needs for better regulation and management of metals in rivers include: i) models relating metal data to ecological data that better represent influences on metal toxicity; ii) biodiagnostic indices to reflect metal effects; iii) better methods to identify metal acclimation or adaptation among sensitive taxa; iv) better investigative procedures to isolate metal effects from other pressures. Laboratory data on the effects of water chemistry on cationic metal toxicity and bioaccumulation show that a number of chemical parameters, particularly pH, dissolved organic carbon (DOC) and major cations (Na, Mg, K, Ca) exert a major influence on the toxicity and/or bioaccumulation of cationic metals. The biotic ligand model (BLM) provides a conceptual framework for understanding these water chemistry effects as a combination of the influence of chemical speciation, and metal uptake by organisms in competition with H+ and other cations. In some cases where the BLM cannot describe effects, empirical bioavailable models have been successfully used. Laboratory data on the effects of metal mixtures across different water chemistries are sparse, with implications for transferring understanding to mining-impacted sites in the field where mixture effects are likely. The available field data, although relatively sparse, indicate that water chemistry influences metal effects on aquatic ecosystems. This occurs through complexation reactions, notably involving dissolved organic matter and metals such as Al, Cu and Pb. Secondly, because bioaccumulation and toxicity are partly governed by complexation reactions, competition effects among metals, and between metals and H+, give rise to dependences upon water chemistry. There is evidence that combinations of metals are active in the field; the main study conducted so far demonstrated the combined effects of Al and Zn, and suggested, less certainly, that Cu and H+ can also contribute. Chemical speciation is essential to interpret and predict observed effects in the field. Speciation results need to be combined with a model that relates free ion concentrations to toxic effect. Understanding the toxic effects of heavy metals derived from abandoned mines requires the simultaneous consideration of the acidity-related components Al and H+. There are a number of reasons why organisms in waters affected by abandoned mines may experience different levels of metal toxicity than in the laboratory. This could lead to discrepancies between actual field behaviour and that predicted by EQS derived from laboratory experiments, as would be applied within the WFD. The main factors to consider are adaptation/acclimation, water chemistry, and the effects of combinations of metals. Secondary effects are metals in food, metals supplied by sediments, and variability in stream flows. Two of the most prominent factors, namely adaptation/ acclimation and bioavailability, could justify changes in EQS or the adoption of an alternative measure of toxic effects in the field. Given that abandoned mines are widespread in England and Wales, and the high cost of their remediation to meet proposed WFD EQS criteria, further research into the question is clearly justified. Although ecological communities of mine-affected streamwaters might be over-protected by proposed WFD EQS, there are some conditions under which metals emanating from abandoned mines definitely exert toxic effects on biota. The main issue is therefore the reliable identification of chemical conditions that are unacceptable and comparison of those conditions with those predicted by WFD EQS. If significant differences can convincingly be demonstrated, the argument could be made for alternative standards for waters affected by abandoned mines. Therefore in our view, the immediate research priority is to improve the quantification of metal effects under field circumstances. Demonstration of dose-response relationships, based on metal mixtures and their chemical speciation, and the use of better biological tools to detect and diagnose community-level impairment, would provide the necessary scientific information

    Simulation of carbon cycling, including dissolved organic carbon transport, in forest soil locally enriched with 14C

    Get PDF
    The DyDOC model was used to simulate the soil carbon cycle of a deciduous forest at the Oak Ridge Reservation (Tennessee, USA). The model application relied on extensive data from the Enriched Background Isotope Study (EBIS), which exploited a short-term local atmospheric enrichment of radiocarbon to establish a large-scale manipulation experiment with different inputs of 14C from both above-ground and below-ground litter. The model was first fitted to hydrological data, then observed pools and fluxes of carbon and 14C data were used to fit parameters describing metabolic transformations of soil organic matter (SOM) components and the transport and sorption of dissolved organic matter (DOM). This produced a detailed quantitative description of soil C cycling in the three horizons (O, A, B) of the soil profile. According to the parameterised model, SOM turnover within the thin O-horizon rapidly produces DOM (46 gC m-2 a-1), which is predominantly hydrophobic. This DOM is nearly all adsorbed in the A- and B-horizons, and while most is mineralised relatively quickly, 11 gC m-2 a-1 undergoes a “maturing” reaction, producing mineral-associated stable SOM pools with mean residence times of 100-200 years. Only a small flux (~ 1 gC m-2 a-1) of hydrophilic DOM leaves the B-horizon. The SOM not associated with mineral matter is assumed to be derived from root litter, and turns over quite quickly (mean residence time 20-30 years). Although DyDOC was successfully fitted to C pools, annual fluxes and 14C data, it accounted less well for short-term variations in DOC concentrations

    Fatigue Life Prediction Using Hybrid Prognosis for Structural Health Monitoring

    Full text link

    Bayesian Regression and Classification

    Get PDF

    Active Sampling-based Binary Verification of Dynamical Systems

    Full text link
    Nonlinear, adaptive, or otherwise complex control techniques are increasingly relied upon to ensure the safety of systems operating in uncertain environments. However, the nonlinearity of the resulting closed-loop system complicates verification that the system does in fact satisfy those requirements at all possible operating conditions. While analytical proof-based techniques and finite abstractions can be used to provably verify the closed-loop system's response at different operating conditions, they often produce conservative approximations due to restrictive assumptions and are difficult to construct in many applications. In contrast, popular statistical verification techniques relax the restrictions and instead rely upon simulations to construct statistical or probabilistic guarantees. This work presents a data-driven statistical verification procedure that instead constructs statistical learning models from simulated training data to separate the set of possible perturbations into "safe" and "unsafe" subsets. Binary evaluations of closed-loop system requirement satisfaction at various realizations of the uncertainties are obtained through temporal logic robustness metrics, which are then used to construct predictive models of requirement satisfaction over the full set of possible uncertainties. As the accuracy of these predictive statistical models is inherently coupled to the quality of the training data, an active learning algorithm selects additional sample points in order to maximize the expected change in the data-driven model and thus, indirectly, minimize the prediction error. Various case studies demonstrate the closed-loop verification procedure and highlight improvements in prediction error over both existing analytical and statistical verification techniques.Comment: 23 page

    Statistical Mechanical Development of a Sparse Bayesian Classifier

    Full text link
    The demand for extracting rules from high dimensional real world data is increasing in various fields. However, the possible redundancy of such data sometimes makes it difficult to obtain a good generalization ability for novel samples. To resolve this problem, we provide a scheme that reduces the effective dimensions of data by pruning redundant components for bicategorical classification based on the Bayesian framework. First, the potential of the proposed method is confirmed in ideal situations using the replica method. Unfortunately, performing the scheme exactly is computationally difficult. So, we next develop a tractable approximation algorithm, which turns out to offer nearly optimal performance in ideal cases when the system size is large. Finally, the efficacy of the developed classifier is experimentally examined for a real world problem of colon cancer classification, which shows that the developed method can be practically useful.Comment: 13 pages, 6 figure

    BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

    Full text link
    The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26x-629x while guaranteeing the same predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201

    The interaction between bovine serum albumin and surfactants

    Full text link
    corecore